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Abstract 

In this paper, we extend to generalized linear models (including logistic and other binary 
regression models, Poisson regression and gamma regression models) the robust model selection 
methodology developed by Miiller and Welsh (2005) for linear regression models. As in Miiller 
and Welsh (2005), we combine a robust penalized measure of fit to the sample with a robust 
measure of out of sample predictive ability which is estimated using a post-stratified m-out-of-n 
bootstrap. A key idea is that the method can be used to compare different estimators (robust 
and nonrobust) as well as different models. Even when specialized back to linear regression 
models, the methodology presented in this paper improves on that of Miiller and Welsh (2005). 
In particular, we use a new bias-adjusted bootstrap estimator which avoids the need to centre the 
explanatory variables and to include an intercept in every model. We also use more sophisticated 
arguments than Miiller and Welsh (2005) to establish an essential monotonicity condition. 
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1 Introduction 



Model selection is fundamental to the practical application of statistics and there is a substantial 
literature on the selection of linear regression models. A growing part of this literature is con- 
cerned with robust approaches to selecting linear regression models: see Miiller and Welsh (2005) 
for references. The literature on the selection of generalized linear models (GLM; McCullagh and 
Nelder, 1989) and the related marginal models fitted by generalized estimating equations (GEE; 
Liang and Zeger, 1986) - though both are widely used - is much smaller and has only recently in- 
corporated robustness considerations. Hurvich and Tsai (1995) and Pan (2001) developed Akaike 
information criterion (AIC) like criteria based on the quasi-likelihood, Cantoni, Mills Flemming, 
and Ronchetti (2005) presented a generalized version of Mallows' Cp, and Pan and Le (2001) and 
Cantoni et al. (2007) presented approaches based on the bootstrap and cross-validation, respec- 
tively. Our purpose in this paper is to generalize the robust bootstrap model selection criterion 
of Miiller and Welsh (2005) to generalized linear models. 

The extension of the methodology of Miiller and Welsh (2005) from linear regression to 
generalized linear models is less straightforward than we expected and, as a result, the present 
paper differs from Miiller and Welsh (2005) in two important respects. First, the bias-adjusted m- 
out-of-n bootstrap estimator P^m~^*iPa'Tn~ Pa) rather than the m-out-of-n bootstrap estimator 

^ (2) 

Pa',m is used in estimating the expected prediction loss Mn (a) (definitions are given in Section 
2). As discussed in more detail in Section 3.2, this achieves the same purpose but avoids the 
centering of the explanatory variables and the requirement that we include an intercept in every 
model used in Miiller and Welsh (2005). Second, we present a simpler, more general method 
than that used in Miiller and Welsh (2005) for showing that the consistency result applies to 
particular robust estimators of the regression parameter. As discussed in Section 3.3, we use 
generalized inverse matrices to decompose the asymptotic variance of the estimator into terms 
which are easier to handle, write the critical trace term as a simple sum and then show that the 
terms in this sum have the required properties. Both of these changes were necessitated by the 
more complicated structure of generalized linear models but they also apply to regression models 
where they represent improvements to the methodology of Miiller and Welsh (2005). 

Suppose that we have n independent observations y = (yi, . . . , Un)'^ aud an n x p matrix X 
whose columns we index by {1, . . . ,p}. Let a denote any subset of pa distinct elements from 
{1, . . . ,p} and let Xa denote the n x pa matrix with columns given by the columns of X whose 
indices appear in a. Let denote the ith row of Xa- Then a generalized linear regression model 
a for the relationship between the response variable y and explanatory variables X is specified 
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by 

^yi = KVi), Varyj = with ?7i = x^j/3„, i = l,...,n, (1) 

where (3^ is an unknown p^- vector of regression parameters. Here h is the inverse of the usual hnk 
function and, for simphcity, we have reduced notation by absorbing h into the variance function 
V. Both h and v are assumed known. Let A denote a set of generahzed hnear regression models 
for the relationship between y and X. The purpose of model selection is to choose one or more 
models a from A with specified desirable properties. 

Our perspective on model selection is that a useful model should (i) parsimoniously describe 
the relationship between the sample data y and X and (ii) be able to predict independent new 
observations. The ability to parsimoniously describe the relationship between the sample data 
can be measured by applying a penalised loss function to the observed residuals and we use 
the expected variance-weighted prediction loss to measure the ability to predict new observa- 
tions. In addition, we encourage the consideration of different types of estimator of each of the 
models. Possible estimators include the nonrobust maximum likelihood (see Kiinsch, Stefanski, 
and Carroll, 1989; Cantoni and Ronchetti, 2001; Ruckstuhl and Welsh, 2001) and the maxi- 
mum quasi-likelihood estimators (see McCuUagh and Nelder, 1989) and the robust estimators 
of Preisser and Qaqish (1999), Cantoni and Ronchetti (2001), and Cantoni (2004). The Cantoni 
and Ronchetti (2001) estimator is described in Section 3.3. 

We define a class of robust model selection criteria in Section 2, present our theoretical results 
in Section 3, report the results of a simulation study in Section 4, present a real data example in 
Section 5, and conclude with a short discussion and some brief remarks in Section 6. 



2 Robust model selection criterion 

Let denote an estimator of type c G C of /3q, under (1), let o" be a scale parameter, let p 

be a nonnegative loss function, let 5 be a specified function of the sample size n, let a denote 

a measure of spread of the data, and let y be a vector of future observations at X which are 

independent of y. Then, we choose models a from a set A for which the criterion function 

2 ( " 

M(a) = —\ -E^Y^Wo^iPliVi - KxLl^a)}l<^v{m)]+^{n)Pa 



i=l 



i=l 

is small. In practice, we often supplement this criterion with graphical diagnostic methods which 
further explore the quality of the model in ways that are not amenable to simple mathematical 



description. 

As in Miiller and Welsh (2005) we separate the estimators and p because in practice we 
want to compare different estimators indexed by c € C and Unking p to any one of these estimators 
may excessively favour that estimator. We adopt the view that we are interested in fitting the 
core data and predicting core observations rather than those in the tail of the distribution so take 
p to be constant for sufficiently large The simplest example of such a function (and the one 
we use in our simulations) is the function which is quadratic near the origin and constant away 
from the origin as in 

p(z) = min(z2,62)^ (3) 

Following Miiller and Welsh (2005), we choose 6 = 2. Smoother versions of p such as are required 
in our theoretical results are easily defined and we can, when appropriate to the problem, use 
asymmetric p functions. The weights Wai are Mallows' type weights which may be included for 
robustness in the X space but can and often will be constant. The only restrictions on the 
function 6 are that 5{n) 00 and 5{n)/n ^ as n ^ 00. A common choice is 5{n) = k\og{n) 
for A; > where we choose k = 2 (e.g. Schwarz, 1978; Miiller and Welsh, 2005). If the criterion 
were based on the penalized loss function alone then 6 would have to be of order higher than 
O(loglogn) as shown in Qian and Field (2002, Theorem 1-3) for logistic regression models. 

Let /3q be an estimator of type c of the model a, and if a has to be estimated, we estimate it 
from the Pearson residuals {yj — h{x'^^.^f3^^)}/v{x'^^^(3^^.), i = 1, . . . ,n, from a "full" model aj. A 
"full" model is a large model (often assumed to be the model {1, . . . ,p}) which produces a valid 
measure of residual spread (but hopefully not so large that we incur a high cost from overfitting) . 
We omit the subscript a / and denote the estimator of o" by ? for notational simplicity. Then we 
estimate the penalized in-sample term in the criterion function (2) by a'^'^{Mn\c() + n~^6{n)pa}, 
where 

Next, we implement a proportionally allocated, stratified m-out-of-n bootstrap of rows of (y, X) 
in which we (i) compute and order the Pearson residuals, (ii) set the number of strata K at 
between 3 and 8 depending on the sample size 7i, (iii) set stratum boundaries at the 

K-\2K-\...,(Er- l)i^~^ 

quantiles of the Pearson residuals, (iv) allocate observations to the strata in which the Pearson 
residuals lie, (v) sample (observations in stratum k)m/n (rounded as necessary) rows of (y, X) 
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independently with replacement from stratum k so that the total sample size is m, (vi) use these 
data to construct the estimator repeat steps (v) and (vi) B independent times and then 

estimate the conditional expected prediction loss by a'^'^Mn\a), where 

where denotes expectation with respect to the bootstrap distribution. In practice, it seems 
useful to take m to be between 25 — 50% of the sample size n if working with moderate sample 
sizes, e.g. 50 < n < 200. If n is small then m is small and the parameter estimators in the 
bootstrap do not converge for some bootstrap samples though this typically occurs less often 
with the stratified bootstrap. If n is large then m can be smaller than 25% of the sample size n. 
Combining (4) and (5), we estimate the criterion function (2) by 

The use of the stratified bootstrap ensures that we obtain bootstrap samples which are similar 
to the sample data in the sense that observations in the tails of the residual distribution and 
outliers are represented in each bootstrap sample or, with categorical data, each category is 
represented in the bootstrap samples. In essence, we construct an estimate of the conditional 
expected prediction loss based on samples which are similar to the sample we have observed. 
The estimated variance function is estimated from a "full" model so does not change with the 
model a. This simplifies the procedure and has the advantage of making the procedure more 
stable. Finally, we use the bias-adjusted bootstrap estimator /3^*m ~ '^*{fia,m ~ i^a) rather than 
the bootstrap estimator in M„ (a). As discussed in more detail below, this achieves the 
same purpose as but avoids the centering technique used in Miiller and Welsh (2005) and means 
that we do not have to include an intercept in every model. It is therefore a useful refinement of 
the criterion given in Miiller and Welsh (2005). 

The computational burden of model selection can be reduced by limiting the number of 
different estimators we consider, reducing their computation by, for example, using good starting 
values from the initial fit to the data, and by reducing the number of models in A. Generally, our 
approach is to use an eclectic mix of methods including robust versions of deviance-tests, search 
schemes, diagnostics etc to produce a relatively small set A of competing models which we then 
compare using the methodology presented in this paper. In particular, we present a backward 
model search algorithm in Section 3.4 that substantially reduces the number of models to be 
considered while maintaining the consistency of M„(a). 
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3 Theoretical results 



Our procedure is intended to identify useful models whether or not a true model exists and our 
interest is not restricted to a single best model but to the identification of useful models (which 
make M„(a) small). In this context, if (i) a true model oq exists and (ii) oq Q {!,..., p}, 
then consistency in the sense that a procedure identifies ao with probability tending to one is a 
desirable property. Although in practice, we are interested in all the models which make M„(a) 
small, we focus in this section on the model which minimises Mn{a) and show that choosing this 
model is consistent. Specifically, for c G C, we define 

= argminM„(a), (6) 

and develop conditions under which for each c € C, 



As in Miiller and Welsh (2005), we define the subset of correct models Ac in A to be the 
set of models a (z A such that ao Q a; all other models are called incorrect models. For any 
correct model a G Ac, the errors eai = yi — h{x^^l3a) satisfy eai = eagj, for i = 1, • • • , n, and the 
components of [ia corresponding to columns of Xa which are not also in uq equal zero. 

3.1 Conditions 

It is convenient to introduce a simplified notation for stating the conditions and simplifying the 
proof of the main result. Write 




ao} = 1- 



(7) 




(Ti — (Jv{x^^j^l3aj\ f-aQi 



tpi = '4){ei/ai), and Vi = i^'^^ijoi). 



Then we require the following conditions. 



(i) The "Pa X Vci matrix 



1 



n 



2re 




where Vi, is of full rank. 
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(ii) For all models a A (including the full model), the estimators — /3a = Op{n ^/^), 

— a = Op{n~^/'^) with a > 0. For all correct models a G Ac, 

nVar(^^) = Ea + Op(l), 

where Sq, is of full rank. 

(iii) For all models a G A, the bootstrap estimator (3^^ — > (3^ in probability. For all correct 
models a G Ac, 

mVar.iPZ,) = Var(^^) + Op(l) 
and for any two correct models ai, 02 € Ac such that ai C 02 

trace(Sa2rQ,2) — trace(SQ,irQj) > 0. (8) 

(iv) The sequence 6{n) = o{n/m) and m = o{n). 

(v) The derivatives ip = p' and ip' exist, are uniformly continuous, bounded, Var(ejV'j) < 00, 
and E^'(ej) > 0, i = 1, . . . , n. 

(vi) The weights are bounded, h and its first two derivatives are continuous, a and v are both 
positive, and v' is bounded. 

(vii) The Xi are bounded. 

(viii) For any incorrect model a, 

liminfMW(Q) > Hm M^^\ao) a.s. 

n^oo n— >oo 

Condition (i) is a generalization of a standard condition for fitting regression models which we 
require for generalized linear models. Condition (ii) is satisfied by many estimators; condition 
(8) restricts the estimators we can consider in C but allows us to include maximum likelihood 
and other estimators such as the Cantoni and Ronchetti (2001) estimator. We refer to (8) as 
the monotonicity condition. Condition (iii) specifies the required properties of the bootstrap 
parameter estimator. In contrast to Miiller and Welsh (2005), we have adjusted the bootstrap 
estimator so we do not have to impose conditions on the asymptotic bias of the bootstrap estima- 
tor. Combining conditions (ii) and (iii), we obtain Var*(/3^'^) = m~^K^Yjct + Op{m~^). Conditions 
(v)-(vii) enables us to make various two-term Taylor expansions and to control the remainder 
terms. We require a higher level of smoothness than exhibited by the p-function (3) but there 
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are many functions satisfying these properties. Condition (viii) is a generalisation of Condition 
(C4) of Shao (1996) to allow a more general choice of /)(•). 

We have specified a simple set of sufficient conditions (particularly in conditions (v)-(vii)) 
which are appropriate for a robust p function and generalized linear models. However, we note 
that we can specify alternative conditions and simpler conditions for particular cases. For ex- 
ample, we obtain alternative conditions if we allow the xi to be stochastic; see for example Shao 
(1996, Condition C3. b.). We can simplify our conditions if we use the nonrobust function 
p{x) = x^; again see Shao (1996, p661). Even in the robust case, simpler conditions can be given 
for homoscedastic linear models because h{x) = x, v{x) = 1. These possibilities are somewhat 
tangential to our main purpose so we will not pursue them in this paper. 

Theorem 3.1. Under the above conditions, the consistency result (7) holds. 

Proof of Theorem 3.1. 

The proof of this result is similar to that given in Miiller and Welsh (2005). The main term 
we need to deal with is the bootstrap term 

M(2)(a) = iE,f;Tz;«,/,|(y,-/i[x^,{^;^-E,(^;^-^„)}])/aA, 

where (3a and ai = av{x'^^^Paf) are constant with respect to the bootstrap. We make a Taylor 
expansion of p as a function of /?* ^ — E*(/?* ^ — (3a) about Pa, to obtain 

1 " 

Mf)(a) = -y2wa^p[{y^-h{xlM}/^^] 

1 " ^ ^ ^ 

+ E=K — O- WaiX^^{(3^ ,i^ — E^, (3a^m)iPa,m ~ E* Pa,m) ^ai 
i=l 

X {h'{xliPafi;'[{y, - h{xl,Pa)}/ai] - h"ixl,Pam{yi - h{xl,Pa)}/^i]) 
= T1+T2, 

where \(3a — < \Pam. ~ Pa\- This equation is analogous to (9) in Miiller and Welsh (2005) 
except that we have eliminated the linear term by using the bias-adjusted bootstrap estimator. 
We consider Ti and T2 in turn. 
Order of T2: Let 

Hai = h'{xl,pafi;'[{yi - h{xlM}/^^] " h" {xlMi'[{y^ - h{xlM}/a,] 

= h'{xl,pafi;'[{ei + h{xlM - h{xlM}/^i] - h"{x^,Pam{ei + h{xl,(3a) - h{xl,pa)}/ai] 
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and write 



1 " _ ^ _ 

i=l 

1 " ^ , 

i=l 

1 " ^ , 

i=l 

1 " ^ ^ ^ 



Then 



1 " ^ , 

— ^ Var*(/3*^^)x„i(/i^^i E^/j- - /i^i E V'i) 

trace |var*(^; „,) ^ (T,"^u;aiX„i2;^i(/i|^2i E - <i E Vi) 



2n 
2m 



trace(SQ,rQ,) + Op(m ^) 



by condition (iii) and the first part of condition (i). Similarly 
1 " ^ , 



i=l 



by condition (iii) and the second part of condition (i). Finally, 



^*w-^WaixliiP*^^-E^P*^^)0^^„,-E^P*^^^ ^Hai-a^ ^h'^iil^'i+a- = Op{m ^) 



2n 

i=l 

provided 



max sup sup h' {x^itf ij)' [{e + h{xiiPa) - h{xiit)}/ai] - ^K\xi^iiia)i^' {tj Oi)\ = Op(l) 



T . 



-2t'2/ T a 



l<i<n 



and 



|t-/3c<|<n-l/2c 



max sup sup 

l<i<n 



\a-'h"{xi^)^[{e + h{x'M - h{xi,t)]/di\ - a-'h"{xi,Pame/ai)\ = Op{l). 

\t-l3a\<n-'^/^C 

Conditions (v)-(viii) ensure that these requirements hold. 

Order of Ti: Let — < \Pa — Pa\, IPaj — Paj \ < IPaj — Paf I and — o"| < |5 — (t|. Recall 
that fjj = (Tv[haji) and write 



D{yi, hai,ai 



-Xaih'^iCF^'^ip{{yi - hai)/ai} 
-a~'^v{h~^{hafi)){yi - hai)ip{{yi - hai)/ai] 

\ -Xaji(Tl'^CFv'{h~^{hafi)){yi - hai)lp{iyi - hai) / di} J 
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Then 

^ n \ ^ 1 " 

-^^WaiPiiVi - hai)/ai} = -'^Waip{ei/(Ti) + - '^Wai{Pa - Pa,'^ - (^,Paf - PofV D{yi, hai, (7i) 
i=l i=l i=l 

1 " ^ - ~ _ 

H '^Waiil3a - f3a,'^ - Cr, - PafV {DiVi, hai, CTi) - D{yi, hai, CFi)} 



n . 

1=1 



1 " 



provided 



max sup sup \a- ^ h' {x'^^t)^p[{€ + /i(x^/3q,) - /i(x^it)}/ai] - cr- ^/i^j^leM)! = Op{l) 



l<i<n e \t-f3^\<n-^/^C 



max sup sup \a ) v {x^ iPaf){e + h{x^iPa) - h{x^it)} 



max sup sup \a '^v{x^ j3af) ^ {e + h{x1^^(3a) - h{x1^it}^[{e + h{x^i[3a) 



~h{xl^it)] /di] - a ^v{xl^if3af) ^eij{e/ai)\ = Op{l). 



As for T2, these results follow from conditions (v)-(vii). 
Putting both terms together, it follows that 



1 , \ — 

M^^\a) = -y^WaipiiVi - h{x'^iPa))/'^i} + —traceiT^aTa) + Op{m~^) (9) 
and the proof is completed as in Miiller and Welsh (2005). □ 



3.2 The elimination of bias 

One of the main difficulties in constructing model selection criteria like M„(a) is removing the 

(2) 

bias (equivalently the linear term) in the expansion of M„ (a). Suppose that instead of the 

bias-adjusted bootstrap estimator — E,,(/3^*^ — f3^), we use the bootstrap estimator /?^*^ 

(2) (2) 
in Mn (a). Then when we expand Mn (a) as in Shao (1996), Miiller and Welsh (2005) or the 

proof of Theorem 3.1, we obtain the linear term 

^ ^ 1 " _ 

E*(/?a,m - Pa)'^ -'^'^^^WaiXaih'{xliPa)lp{{yi - h{xliPa)) f^i} ■ (10) 
i=l 

As shown in Miiller and Welsh (2005), the bias term E=i<(/3* m ~ Pa) is typically a function of a 
with leading term Op(m~^), the same as the quadratic term in the expansion. Since the quadratic 
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term governs the selection of correct models, it is crucial that the linear term be at least of smaller 
order. 

There are various ways to make (10) of order Op(m~^). Notice that ordinarily the mean in 
(10) is asymptotic to 

1 

i=l 

which is 0(1). However, if E^/^j = 0, then it can be Op(n~^/^) which can be made Op{m~^). This 
is the approach used in Shao (1996). It holds when ^(x) = x but this is a nonrobust choice 
and hence unappealing in general. Miiller and Welsh (2005) took a different approach in which 
they insisted that each model contain an intercept and then centered the explanatory variables 
so that they have mean zero and the bias is forced into the intercept. In fact, the intercept can 
be eliminated by replacing the intercept of the bootstrap estimator by that of the estimator /3q, 
or by fixing the intercepts at the value of the intercept estimated under a "fuH" model. This 
approach is much less attractive in the present more general context because the centering vector 
has to include estimates of di and E-i/'j (which previously did not depend on i) and /i^j (which 
was previously not present). This means that the centering vector is stochastic and the centered 
explanatory variables cannot simply be conditioned on. Even if we overcome these difficulties, 
we have to ensure that the criterion is consistent and the arguments given in the next subsection 
do not apply unless the model is fitted with the same covariates as the model selection criterion 
uses. This approach is not therefore very attractive. 

A different approach would be to require as in Miiller and Welsh (2005) that E^,(/3* m~ Pa) = 
m~^Ba + Op{m~^), estimate Ba and then adjust the criterion by subtracting off an estimate of 
(10). Although this will remove the bias, it will add a contribution to the quadratic term which 
will affect the arguments in the next subsection. Also, it changes the criterion which then loses 
its natural interpretability. It is far better to think in terms of adjusting the bootstrap estimator 
/?* „ for bias. We could do this by focussing on Ba (as we only need the leading term) but then 
we would need to derive and estimate B^ for each estimator we consider. Fortunately, we have 
available the bias itself in the very natural form E*(/?* ^ — /?„) and so we can remove the bias 
entirely without having to assume any particular form. This is the solution that we have adopted 

(2) 

in using the bias adjusted bootstrap estimator in Mn (a). 
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3.3 The monotonicity of trace(EQ^rQ,) 

The assumption (iii) that trace{Y^ara) is monotone in pa does not hold in general for arbitrary 
positive semi-definit matrices Sq, and Ta- For example, with 





-0.5 1 




^.0 


0.2\ 


r 










\ -0.5 


1.0) 




10.2 


O.lj 



Sq,2 — I I , — I I , Sq,j^ — 1.0, r^j^ — 1.0, 

and we find that 

tT:ace{T,a2^a2) ~ trace(SQ,^rQj) = 0.9 — 1.0 = —0.1. 

However, Miiller and Welsh (2005) prove that for linear regression models, the condition holds 
for the class of Mallows type M-estimators or one-step Mallows type M-estimators etc., because 
of the relationship between Var(/3Q) and Tq. 

Consider the maximum likelihood estimator for generalized linear models. We can write 
condition (i) as 

where Wr^ = \ diag(crf E V'i - /i^^ E-0i), . . . , a~'^Wan{h!an^'4^'n - EV'n))- From Mc- 

Cullagh and Nelder (1989, p43), the maximum likelihood estimator [3a satisfies 

n Var(^„) = {X^Wj^M"^ + Op(l), 

where W^.^ = diag(/i^^]^/cjf , . . . , h'^^/a^). We have to show that 

tvace^iX^W^M^'X^Wr^Xa} 

is strictly monotone increasing in pa- 

Reorder the rows of X^ if necessary so that the top pa x p^ submatrix Cq, is nonsingular. 
Then the pa x n matrix X^ = {Ca^,0) is a generalized inverse of Xa- Then we have that 

trace (^X-W^^X-^X^Wr^Xa) = trace [x^X-W^^X-^ X^Wr^) 

= trace (XaX-W^^X^X-Wr^ ) . 

By definition of the generalized inverse, XaX~ is a symmetric nxn matrix with first pa diagonal 
elements equal to +1 and the remaining elements zero so that XaX~ = diag(l, . . . , 1, 0, . . . , 0). 
Therefore, 



-1 Z'" 

trace {{X]^Wj:M-'X^Wr^Xa} = ^Yl 
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The simplest sufficient condition for monotonicity is 

/il2.EV->CEV'i, i = l,...,n. (11) 

Since the left hand side is positive, it suffices to show that h'^^ Ei/^j < for i = 1, . . . ,n. 

The monotonicity condition (11) holds if E-i/^j = or /i^. = 0. The first case occurs when (i) 
p{x) = or (ii) the ei = Ui — h^i has a distribution which is symmetric about zero and ip is 
antisymmetric and the second when we use the identity link so h{x) = x. Shao (1996) exploited 

(i) but this choice favours least squares estimation and is non-robust so we prefer not to use it; 

(ii) applies to Gaussian models but not to models with asymmetric distributions. Similarly, the 
identity link is widely used in Gaussian models and may be used in gamma models but is not 
useful in binomial and Poisson models. In these cases, we need to examine (11) more carefully. 
For the log link which is often used in Poisson and gamma models 

K^ai) = h'{r]ai) = h"{r]ai) = exp(ry„i) > 

and for the reciprocal link which is often used in gamma models 

1 12 
Hvai) = , h'{r]ai) = 2"' h"{rjai) = — > 0. 

"Hai Vai 

However, for many right skewed distributions like the Poisson and gamma, anti-symmetric ip 
functions with sufficiently large b truncate more of the upper tail than the lower tail so E-f/^j < 0. 
To see this, note that for the ^ function (3), we can write 



/oo 
7p{z)dF{az + fi) 

= I 2zdF{az + fi) 

J — min(fe,/x/o-) 

/— min{b,/i/(T) poo 
2zdF{az + 2zdF{az + n) 

'Uulcr Jh 



< 0, 

provided h is large enough to ensure that XT^'ji"^^'^/"^ zdF{az + /i) + zdF{az + /x) > 0. It 
follows that h^^ EV'i < and (11) holds in these cases. For the logistic link 

exp(r?ai) . exp(r?aj) exp(r?„j) - exp(2r?„j) 

Kriai) = — -, r, h [riai) = -— -, rr^, h [rjai) - 



1 exp(7?Q,j) ' (1 -hexp(??Q,j))2' (1 exp(r/„i))3 

so that hai < 1/2, h^.^ > if t?q,. < and h^i > 1/2, h^^ < if r]a.^ > and we need a more 
careful analysis. The Bernoulli model can be left or right skewed depending on the value of hai 
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so E^j can be positive or negative. Fortunately, for anti-symmetric V'j 



= i^{-KJh'J{i - + - KJ/h'^JK^ 
= -ij{KJh'^^){i - KJ + V{(1 - h^J/h'^Jh^^ 

from which Eipi < if 77^. < and Eipi > if 77^. > so that h'^- Eipi < and (11) holds. 

Next, we consider the quasi-likelihood estimator for the logistic model as defined in Cantoni 
and Ronchetti (2001, Section 2.2). The Mallows quasi-likehhood estimator is the solution of the 
estimating equations, 

1 



E 



i=l 



0, 



(12) 



where = {i/i — h{x^^Pa)} /v{Xail3a) are the Pearson residuals, ipc is the Huber function defined 
by 

r, \r\ < c, 

csign(r), \r\ > c, 



and 



1 " 1 

= - y^w{Xai)Xai—-^jr—-h'{x'^iPa)'E'4'c{rai)- 



i=l 

When w{xi) = 1, the estimator is called the Huber quasi-likelihood estimator. In general we do 
not require that ipc = p' = ip or that Wai = w{xai)- Cantoni and Ronchetti (2001, Appendix 
B) show that the estimator has an asymptotic normal distribution with asymptotic variance 
= M-iQ„M-\ where 

Qc. = -XlAX^-a{l3)a{P)'^ and M„ = 

n n 

with A and B are diagonal matrices with diagonal elements 

1 



wiXaiY 
w{Xai)- 



h'{xlM''EMra^y 



1 



h'ix'^iPaYEraiipciro 



av{xlif3a) 

Using the same generalized inverse as before so that 



Xo^X- = diag(l, . . . , 1, 0, . . . , 0) = En,p^, 
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we have to show that trace(MQ, ^Q^M^ ^Fq,) is monotone in p^- Indeed, 

trace(M-ig^M-ir«) = trace{X- B'^X-^ X^AX^X' B'^X-^ X^WrM 

= tTace{X^X~B~^X-''X^AX^X-B-^X~''X^WrJ 
= trace{En,p^B~^En,p^AEn,p^B-^En,p^WrJ 

1=1 

is a monotone function in pa- This function is monotone in p^ under the same conditions as the 
analogous function for maximum hkehhood estimation. 

3.4 The reduction of models 

For any incorrect model a € ^ \ it follows from condition (vi) in Section 3.1 and from (9) 
that 

liminf M„(a) > lim Mn(ao) a.s. 

n— >oo n— ►oo 

and for any correct model a £ Ac 

Mnia) = -}^Waip{{yi - h{x^{(5a))/di} + — trace(SQ,ra) + Op(m~^) 
1=1 

Hence, it follows that for fixed paj we also have 



liminf min M„(a) > lim maxM„(ao) a.s. (13) 

Equation (13) ensures that backward model selection schemes based on M„(a) maintain 
consistency for the true model if A is the set of all possible 2^°/ submodels. In particular 
we suggest using the following backward selection algorithm if the number of submodels to be 
considered is too large to be dealt with in practical problems. 

Algorithm 3.1. 

1. Calculate M„(a) for the full model a/- = {1, . . . and a/,-i = {1, • • • \ {^}, i = 
1,... ,Paf, resulting in {M„(a) : #a > - 1}. 

2. Set Of = argmin|^Q,>p^^_^| M„(a) and repeat 1. if aj > 2. 

3. Estimate a by the argmin of M„,(a) over all 1 + Yl^=i ^ = 1 + k{k + 1) /2 considered models. 

An example of the solution paths of all submodels and of the backward selected submodels 
is given in Figure 1 in Section 5. 
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4 Simulation study 



In this section we present a range of simulation results for Poisson regression models. The 
proposed robust model selection criterion based on robust and non robust parameter estimator 
procedures with 6 = 2 and 6{n) = 21og(ri) is compared to the AIC and BIC criteria. 
We generated data according to the Poisson regression model 

rji = log fii = (3i+ f32X2i + Psxsi + PiX^, yi ~ Poi{fii), i = l,...,n, (14) 

with true parameter vectors (1,0,0,0), (—1,2,0,0), and (—1,1,1,0) such that YljPj — 1- The 
response variable is Poisson distributed with mean /ij. The explanatory data is generated by 
drawing pseudo-random numbers from the multivariate normal with mean vector (1,1,1) and 
covariance matrix given by diagonal elements equal to 1 and off diagonal elements equal to 0. 

In this non-robust setting we generated for each of the 500 simulation runs n = 64 data points 
and estimated the parameters by (3ml using the glm.f it (ML estimator) and by j3cR using the 
glmrob (Mallows or Huber type robust estimators; see Cantoni and Ronchetti, 2001) function 
in R. We calculated AIC, BIC, and the proposed robust model selection criteria with 8 equally 
sized strata based on the Pearson residuals from the full model, 

The bootstrap estimators for m = 24 are based on i? = 50 bootstrap samples. Selection 
probabilities are presented in Table 1. Note that for 500 simulations the empirical standard 
deviations for the empirical selection probabilities vf are given by sd^ = y^7f(l — 7f)/500 < 0.023. 

Put Table 1 around here. 

In this non-robust simulation the overall performance of the selection criteria S is superior 
to classical criteria such as the AIC and BIC criterion independently of the chosen estimation 
procedure. As an example consider the results for the true parameter vector (1,0, 0, 0) where the 
selection probabilities of the true model using 13ml are 0.58 for AIC, 0.60 for BIC, 0.90 for S, 
and using [3cr the estimated probability is 0.89 for a. 

Next we generated data according to the model in equation (14) but we added 8 moderate 
outliers in the response for the 8 observations with largest explanatory variable x^. That is if 
rank(x4j) := X^^^i l(a^4fc < x^i) > 57 then tn ~ Poi{W), i = l,...,re. All other simulation 
specifications remain the same. The selection probabilities are presented in Table 2. 

Put Table 2 around here. 
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In Table 2 we see that the proposed selection criterion used with the robust estimator from 
Cantoni and Ronchetti (2001) performs outstandingly well. Used with the maximum likelihood 
estimator, it still performs very well compared to AIC and BIC. As an example consider the 
results for the true parameter vector (—1,2,0,0). The selection probabilities of the true model 
using Pml are 0.01 for AIC, 0.01 for BIC, 0.66 for 3, and using (3cr the selection probability 
equals 0.78 for a. 

Finally, we generated data according to the model in equation (14) but we added 2 influential 
outliers in the response variable according to the condition rank(2;4j) < 2 then yi ^ Poi(lOO), 
i = 1, . . . ,n. All other simulation specifications remain the same. The selection probabilities are 
presented in Table 3. 

Put Table 3 around here. 

Table 3 shows that the robust model selection criterion can break down if it is used with (3 ml 
but still perform well with robust parameter estimators. As an example consider the results for 
the true parameter vector (—1, 1, 1,0). The selection probabilities of the true model using (3ml 
equals for AIC, BIC, and a. On the other hand, using (3cr the estimated probability is 0.71 
for a. 

5 Real data example 

In this section we present a real data example on the diversity of arboreal marsupials (pos- 
sums) in the montane ash forest (Australia) which is part of the robustbase package in R 
(possumDiv.rda). The dataset is extensively described by Lindenmayer et al. (1990, 1991) and 
serves as a generalized linear model example with a canonical link function having Poisson dis- 
tributed responses conditional on the linear predictor (Weisberg and Welsh, 1994; Cantoni and 
Ronchetti, 2001). The number of of different species (diversity, count variable, mean = 1.48, 
range = — 5) was observed on n = 151 sites. The explanatory variables describe the sites in 
terms of the number of shrubs (shrubs, count variable, 5.06, — 21), number of cut stumps from 
past logging operations (stumps, count variable, 0.09, — 1), the number of stags (stags, count 
variable, 7.24, — 31), a bark index (bark, ordinal variable, 7.91, — 29), the basal area of acacia 
species (acacia, ordinal variable, 4.83, 0— 10), a habitat score (habitat, ordinal variable, 11.96, 
— 39), the species of eucalypt with the greatest stand basal area (eucalypt, nominal variable, 
three categories), and the aspect of the site (aspect, nominal variable, four categories). We 
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calculate 3 based on Pcr with the same specifications as in the simulation study but because n 
is considerably larger than 64 we choose a smaller proportion for the bootstrap. That is m = 40 
which is about 26% of the sample size. The best model according to our criterion M„(a) includes 
stags and habitat which are also selected if the backward selection algorithm in Section 3.4 
is applied. The solution paths of Mn{oi) is given in Figure 1 which shows the minimal value of 
Mn{oi) for all considered models with the same number of variables. 




Figure 1: Solution path for the minimum of M„(a) given a fixed number of 
non zero slope parameters for all submodels (asm) and for backward selected 
submodels (bsm). 

Cantoni and Ronchetti (2001) mentioned that there are four potentially influential data points, 
namely, observations 59, 110, 133, and 139. According to the results of our simulation study we 
therefore consider S together with Pcr to be superior to AIC, BIC, and S with [3ml- Table 4 
presents an overview of the estimated best model which includes also the results of Cantoni and 
Ronchetti (2001, Section 5.2). 

Put Table 4 around here. 
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6 Discussion and conclusions 



We have proposed a bootstrap criterion for robustly selecting generalized linear models. The cri- 
terion is a generalization of that developed for regression models by Miiller and Welsh (2005) and 
has its strengths while still improving on that criterion. In particular, the criterion (i) combines a 
robust penalised criterion (which reflects goodness-of-fit to the data) with an estimate of a robust 
measure of the conditional expected prediction error (which measures the ability to predict as yet 
unobserved observations), (ii) separates the comparison of models from any particular method of 
estimating them, and (iii) uses the stratified bootstrap to make the criterion more stable. The 
improvement is achieved by using the bootstrap to estimate the bias of the bootstrap estimator 
of the regression parameter and then using the bias-adjusted bootstrap estimator instead of the 
raw bootstrap estimator in the criterion. This step widens the applicability of the method by 
removing the requirement of Miiller and Welsh (2005) that the models under consideration in- 
clude an intercept. We have also developed a more widely applicable method than that given in 
Miiller and Welsh (2005) for establishing that the criterion can be applied with particular robust 
estimators of the regression parameters. Our main theoretical result established the asymptotic 
consistency of the method and the simulation study shows that the model selection method works 
very well in finite samples. 
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Table 1: Estimated selection probabilities based on the maximum-likelihood estimator (3ml and on 
the robust estimator Pcr from Cantoni and Ronchetti (2001). The results are based on 500 Monte 
Carlo simulations and the bootstrap is based on 5 = 50 replications. The data has no outlying points. 
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Table 2: Estimated selection probabilities in the presence of outliers based on the maximum-likelihood 
estimator and on the robust estimator from Cantoni and Ronchetti (2001). The results are based on 
500 Monte Carlo simulations and the bootstrap is based on S = 50 replications. The data has 8 
moderately outlying points. 
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Table 3: Estimated selection probabilities in the presence of outliers based on the maximum-likelihood 
estimator and on the robust estimator from Cantoni and Ronchetti (2001). The results are based on 
500 Monte Carlo simulations and the bootstrap is based on S = 50 replications. The data has 2 
strongly outlying points. 
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Table 4: Estimated best model for the Lindenmayer et al. (1990, 1991) data using a range of model 
selection procedures. 
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